[CASCL-1304] Snapshot node managers in ConfigMap during install#2945
[CASCL-1304] Snapshot node managers in ConfigMap during install#2945
Conversation
Once Karpenter is installed on an existing EKS cluster, the user must migrate workloads off the existing nodes. The right migration action depends on how each node is currently managed (Fargate / Karpenter / EKS managed node group / EC2 ASG / standalone) and whether a legacy cluster-autoscaler is still running. Capture that topology in a `dd-cluster-info` ConfigMap so the follow-up migration step has a ground truth to drive from. Errors are logged as a warning and do not fail `install`: the snapshot is informational, and Karpenter is already up by the time it runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run `make update-golang` to align test/e2e/go.mod and go.sum with the indirect aws-sdk-go-v2 / smithy-go patch bumps pulled in by `go work sync`. Required by the GitLab `check-golang-version` job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Here is the output of the manual e2e test: $ kubectl -n dd-karpenter get cm/dd-cluster-info -o yamlkind: ConfigMap
apiVersion: v1
data:
cluster-info: |
apiVersion: v1
clusterName: lenaic-karpenter-test
generatedAt: 2026-04-27T14:08:30.1565563Z
nodeManagement:
asg:
eksctl-lenaic-karpenter-test-nodegroup-ng-bar-NodeGroup-awxH2qriZxFV:
- ip-10-11-233-63.eu-west-3.compute.internal
eksManagedNodeGroup:
ng:
- ip-10-11-232-138.eu-west-3.compute.internal
- ip-10-11-234-250.eu-west-3.compute.internal
ng-foo:
- ip-10-11-234-249.eu-west-3.compute.internal
fargate:
"":
- fargate-ip-10-11-234-123.eu-west-3.compute.internal
- fargate-ip-10-11-234-246.eu-west-3.compute.internal
karpenter:
dd-karpenter-3qdyk:
- ip-10-11-234-209.eu-west-3.compute.internal
- ip-10-11-234-23.eu-west-3.compute.internal
- ip-10-11-234-253.eu-west-3.compute.internal
- ip-10-11-234-37.eu-west-3.compute.internal
clusterAutoscaler:
present: false
metadata:
creationTimestamp: "2026-04-27T14:08:30Z"
labels:
app.kubernetes.io/managed-by: kubectl-datadog
name: dd-cluster-info
namespace: dd-karpenter
resourceVersion: "1321418"
uid: 18e5aec6-2a90-4c5d-96b4-57b2100521a9 |
The new clusterinfo package imports `github.com/aws/aws-sdk-go-v2/service/autoscaling` for DescribeAutoScalingInstances batching, so the third-party license manifest needs the matching entry. Patch matches the diff produced by the CI's `make verify-licenses` step verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
🎯 Code Coverage (details) 🔗 Commit SHA: ec85702 | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2945 +/- ##
==========================================
+ Coverage 40.91% 41.12% +0.21%
==========================================
Files 324 326 +2
Lines 28743 28894 +151
==========================================
+ Hits 11760 11883 +123
- Misses 16129 16150 +21
- Partials 854 861 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a5a25b67c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return err | ||
| } | ||
|
|
||
| if err = recordClusterInfo(ctx, cli, clusterName, karpenterNamespace); err != nil { |
There was a problem hiding this comment.
Snapshot before creating installer-managed nodes
In the default --install-mode=fargate path, installHelmChart has already waited for the Karpenter controller before this call, and the Fargate stack selects the Karpenter namespace, so EKS can already have created the new dd-karpenter-<cluster> Fargate node/profile by the time the snapshot is taken. That makes the ConfigMap include installer-created Karpenter infrastructure as if it were pre-existing topology, and a migration reader using this snapshot to retire legacy Fargate profiles could target the profile that keeps Karpenter running; take the snapshot before installing/waiting for Karpenter or filter the installer-created profile.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 234ffa6.
Classify now runs before createCloudFormationStacks, so the Fargate profile and the controller node it eventually backs are not yet observable from the cluster. Persist stays at the end, where the Karpenter namespace exists. Both calls remain best-effort (warnings on failure).
234ffa6 to
6a5a25b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a5a25b67c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }, | ||
| } | ||
|
|
||
| return commonk8s.CreateOrUpdate(ctx, cli, cm) |
There was a problem hiding this comment.
Preserve the first cluster-info snapshot
When install is run again after a successful Fargate install, this CreateOrUpdate overwrites the existing dd-cluster-info ConfigMap. Since the Karpenter namespace/Fargate profile and controller node already exist before a rerun starts, the original pre-install topology can be replaced with installer-created nodes, so the follow-up migration may treat the dd-karpenter-* profile as legacy infrastructure. If the ConfigMap already exists, keep the original snapshot or filter installer-owned entries instead of updating it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Contesting this one. The "preserve the first snapshot" approach has the same fundamental problem in the more common scenario where the user re-runs install after a partial failure (IAM hiccup, kubeconfig issue, transient AWS error, …): the "first" snapshot would freeze whatever state existed during that partially-successful first run, which may itself include installer-managed nodes — and the user would then be stuck with a wrong snapshot they cannot refresh.
Both alternatives — snapshot-before-install (#234ffa64, since reverted) and preserve-first-snapshot — buy correctness on a perfectly clean first run at the cost of locking the snapshot in cases where iteration is needed. We are accepting the late-snapshot trade-off knowingly: the migration consumer will learn to skip Karpenter-managed nodes in the dd-karpenter-* Fargate profile and the karpenter bucket. Tracking that filter in the migration tool is a one-liner and survives reruns.
Not actioning.
| // ClusterAutoscaler captures whether a legacy cluster-autoscaler Deployment | ||
| // is running and where, so the migration can warn the user to stop it before | ||
| // scaling EKS managed node groups (per the Karpenter migration guide). | ||
| type ClusterAutoscaler struct { |
There was a problem hiding this comment.
nits: maybe the version too.
There was a problem hiding this comment.
Done in ec85702. The version comes from the image tag of the matching container (the source of truth), with app.kubernetes.io/version on the Deployment / pod template as a fallback for digest-only image references. The field is omitted from the YAML when neither signal is available.
| ClusterName string `yaml:"clusterName"` | ||
| GeneratedAt time.Time `yaml:"generatedAt"` | ||
| NodeManagement map[NodeManager]map[string][]string `yaml:"nodeManagement"` | ||
| ClusterAutoscaler ClusterAutoscaler `yaml:"clusterAutoscaler"` |
There was a problem hiding this comment.
should we also collect if karpenter or karpenter-auto-mode was already present?
There was a problem hiding this comment.
We already collect if some nodes are managed by Karpenter.
So, we may miss the fact that Karpenter is up and running if it hasn’t had to create any node yet.
May I suggest to add the presence of a detected Karpenter in a followup PR because I put the logic to detect “foreign” Karpenter (not installed by kubectl datadog) in another dedicated PR: #2949?
Per review feedback: the migration consumer needs to know which CA version is running so it can branch on known-bad versions or surface deprecation guidance. Capture the version, preferring the image tag of the matching container (the source of truth) and falling back to the `app.kubernetes.io/version` label set by most Helm charts. The field is omitted from the YAML when neither signal is available (e.g. an image referenced by digest only and no version label). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec857027f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes. |
What does this PR do?
At the very end of
kubectl datadog autoscaling cluster install, classify every node by its current management method (Fargate / Karpenter / EKS managed node group / EC2 ASG / standalone / unknown) and persist the snapshot as YAML in add-cluster-infoConfigMap in the Karpenter namespace. Also detects whether a legacycluster-autoscalerDeployment is running (and where) and records it as a sibling field in the same ConfigMap.The classification follows the decision tree documented in the Confluence page EKS Node Management Identification Guide — Signals, Tags and Datadog Visibility.
ConfigMap shape:
Motivation
Once Karpenter is up, the user has to migrate workloads off the existing nodes onto the new Karpenter-managed ones. The right migration action depends on how each existing node is currently managed (scale the EKS managed node group to 0, delete the Fargate profile, stop the legacy
cluster-autoscalerfirst, etc.). Today the install command finishes without leaving any record of the cluster's pre-Karpenter topology, forcing a follow-up migration tool (or the user) to re-discover it from scratch.This PR persists the topology in a ConfigMap so the upcoming migration step has a ground truth to drive from. Tracked in CASCL-1304 under epic CASCL-920.
Additional Notes
autoscaling:DescribeAutoScalingInstances(batched at the documented 50-ID per call limit).cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfoplaced undercommon/(notinstall/) so a futureuninstall-side reader does not need to import the install package.clusterAutoscalerfield is a sibling ofnodeManagementso future cluster-level metadata can be added without breaking the schema (apiVersion: v1is set explicitly to allow versioned evolution).Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/cover the empty-cluster, all-six-buckets, ASG-batching (75 nodes → 2 batches of 50+25), cluster-autoscaler detection by name /app.kubernetes.io/namelabel /k8s-applabel / container-image substring, scaled-to-zero CA, andPersistcreate/update round-trips. Run withgo test ./cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/....kubectl get nodes -o wideandaws eks list-nodegroups.Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel